-
Notifications
You must be signed in to change notification settings - Fork 499
[wip] Distributed Scion/Muon #1630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR on cutting-edge features!
I didn't read the papers so please forgive me if what I comment doesn't make sense.
I guess for "core" changes such as this one on optimizers, the recommended path is to first land in pytorch/pytorch, and then expose minimal interfaces to torchtitan. torchtitan shouldn't be a place to host core features.
cc @janeyx99 on interesting optimizer work
Update: the init refactor is done. You can check the diff here I have added the debug configs, so you can try it now: There is also a “clean” version where I removed the logging code, making it easier to read and understand. Note: in a random test, Scion is using a higher learning rate. Muon/Scion allows us to train a model with a high LR.
|
Most of the features are ready, so we can start reviewing them .. |
This is a distributed version of Scion or Modular Norm, muon is considered to be a variant of this by using explicit AdamW for LLM's embedding/output.
Works:
Maybe need to check ETP? And it's not working for multiple shared_experts.
CC @janEbert @ofivite